1. To look for the root cause - the dataset is in factor type which is status not a number.
  2. To find out the probabilistic relationship between the symptom error code and the resolution
  3. Most of the dataset is in factor format which status / state.
  4. Thus, with this structure we can learn and predicts the outcome by learning the relationship between the elements.

My caption.

1

Below is the list of literature review regarding Bayesian Net for root cause analysis :-

Citation Method Output Conclusions
TroubleMiner: Mining network trouble tickets Medem, A. ; Akodjenou, M.-I ; Teixeira, R. 20091 Trouble tickets classification * Automation process on clustering the free text inside the description of the trouble tickets. * Choosing the correct keywords for the analysis Using term frequency distance between trouble tickets and similarity between clusters
Knowledge Discovery from Trouble Ticketing Reports in a Large Telecommunication Company Temprado, Y. ; Garcia, C. ; Molinero, F.J. 2009 Data Mining , Text Mining and Machine Learning , Bayes Net, Naïve Bayes Prediction on the next action of trouble tickets ,Different snapshots were added to the machine learning algorithm for training Combination of multiple method to construct the recommendation , Using Bayesian for prediction
A Bayesian Approach To Stochastic Root Finding 2011 in progress in progress in progress
A Fully Bayesian Approach For Unit Root Testing 2011 in progress in progress in progress
Online Root-Cause Analysis Of Alarms In Discrete Bayesian 2014 in progress in progress in progress
Documents Categorization Based On Bayesian Spanning Tree 2006 in progress in progress in progress
Benefits of a Bayesian Approach to Anomaly and Failure 2009 in progress in progress in progress

My caption.

Process on gathering the dataset

Rules Description
status = ‘Closed’ Dataset must be closed for complete information
network_tt_id is NULL Dataset must be not related to Network Trouble Ticket
trouble ticket type <> PASSIVE Trouble Ticket must related to the Active elements such as routers, switches , modem , etc
installed_date is NOT NULL This field must have value
created_date is NOT NULL This field must have value
closed_date is NOT NULL This field must have value
closed_date is NOT NULL This field must have value
product is NOT NULL This field must have value
sub_product is NOT NULL This field must have value
length description > 10 This field is useful for text analysis
rand() Record selection is in random mode
zone Should selective from different zone , sparse

For sample purpose - selecting dataset from ZONE KEPONG for the analysis due to this zone has the highest records inside the Trouble Ticket dataset.

Documentation - https://github.com/piersharding/dplyrimpaladb

install.packages(c("RJDBC", "devtools", "dplyr"))
devtools::install_github("jwills/dplyrimpaladb")
install.packages("dplyrimpaladb")
  1. Cloudera ‘Impala’, which is a massively parallel processing (MPP) SQL query engine runs natively in Apache Hadoop
  2. Impala’s Place in the Big Data Ecosystem
  3. Flexibility for Big Data Workflow
  4. High-Performance Analytics

Connection to Impala

Basic Impala drivers can be downloaded from https://github.com/Mu-Sigma/RImpala/blob/master/impala-jdbc-cdh5.zip

Below is the components required and how to set the class path for the Impala drivers , RJava , RJDBC and dplyr

setwd("D:/Google Drive/PHD/Progress/phdprogress1")
suppressWarnings(suppressMessages(library("rJava")))
suppressWarnings(suppressMessages(library("RJDBC")))
suppressWarnings(suppressMessages(library("dplyr")))
suppressWarnings(suppressMessages(library("caret")))
suppressWarnings(suppressMessages(library("corrplot")))
suppressWarnings(suppressMessages(library("lazy")))
suppressWarnings(suppressMessages(library("dplyrimpaladb")))
suppressWarnings(suppressMessages(library("rpart")))
suppressWarnings(suppressMessages(library("DiagrammeR")))
suppressWarnings(suppressMessages(library("klaR")))
suppressWarnings(suppressMessages(library("corrplot")))
suppressWarnings(suppressMessages(library("readxl")))
suppressWarnings(suppressMessages(library("tools")))
suppressWarnings(suppressMessages(library("Rgraphviz")))
suppressWarnings(suppressMessages(library("SnowballC")))
suppressWarnings(suppressMessages(library("tm")))


#add the class path for Impala jar 
.jaddClassPath(c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T)))
#add the class path for JVM
.jinit(classpath = c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T)))
#add the class path for dplyr library
dplyr.jdbc.classpath = c(list.files(paste(getwd(),"/lib",sep = ''),pattern="jar$",full.names=T))
conn <- src_impaladb(dbname='nova', host='10.54.1.151')
## Loading required package: testthat
## [1] "here:"
## [1] FALSE
result <-  tbl(conn, sql("select zone from nova.nova_trouble_ticket where zone <> 'null' group by zone order by zone limit 1000"))
as.data.frame(result)
##                          zone
## 1               ZONE AIR ITAM
## 2                  ZONE BANGI
## 3                ZONE BANGSAR
## 4                ZONE BANTING
## 5                   ZONE BATU
## 6             ZONE BATU PAHAT
## 7             ZONE BAYAN BARU
## 8                ZONE BINTULU
## 9         ZONE BUKIT ANGGERIK
## 10        ZONE BUKIT MERTAJAM
## 11            ZONE BUKIT RAJA
## 12           ZONE BUTTERWORTH
## 13             ZONE CYBERJAYA
## 14                ZONE GOMBAK
## 15                  ZONE IPOH
## 16                ZONE KAJANG
## 17                ZONE KEPONG
## 18               ZONE KERAMAT
## 19               ZONE KINRARA
## 20            ZONE KL CENTRAL
## 21                 ZONE KLANG
## 22 ZONE KOTA KINABALU SELATAN
## 23   ZONE KOTA KINABALU UTARA
## 24               ZONE KUCHING
## 25                 ZONE KULIM
## 26              ZONE LANGKAWI
## 27                ZONE MALURI
## 28          ZONE MELAKA UTARA
## 29                  ZONE MIRI
## 30     ZONE N. SEMBILAN UTARA
## 31                ZONE PANDAN
## 32               ZONE PELANGI
## 33                ZONE PERLIS
## 34         ZONE PETALING JAYA
## 35               ZONE PUCHONG
## 36         ZONE SEBERANG JAYA
## 37                 ZONE SENAI
## 38             ZONE SG PETANI
## 39             ZONE SHAH ALAM
## 40                  ZONE SIBU
## 41        ZONE SKUDAI PONTIAN
## 42               ZONE STAMPIN
## 43           ZONE SUBANG JAYA
## 44        ZONE TAMAN PETALING
## 45                ZONE TAMPOI
## 46                   ZONE TAR
## 47                 ZONE TASEK
## 48          ZONE TASIK AMPANG
## 49                   ZONE TDI
## 50           ZONE TELUK INTAN
## 51    ZONE TERENGGANU SELATAN
## 52              ZONE TERUNTUM

So , i can replace the SQL like '%ZONE __________ %' with the zone list above and apply limit 100 for each group of the zone. Each group combined via UNION ALL operation.

Example below :-

My caption.

result <-  tbl(conn, sql("select * from nova_trouble_ticket where zone <> 'null' limit 1"))
as.data.frame(apply(as.data.frame(result),2,class))
##                        apply(as.data.frame(result), 2, class)
## tt_row_id                                           character
## tt_num                                              character
## tt_type                                             character
## tt_sub_type                                         character
## status                                              character
## severity                                            character
## important_message                                   character
## appointment_flag                                    character
## nova_account_name                                   character
## nova_subscriber_num                                 character
## nova_account_num                                    character
## package_row_id                                      character
## created_by                                          character
## category                                            character
## symptom_error_code                                  character
## priority                                            character
## product                                             character
## sub_product                                         character
## package_name                                        character
## network_tt_id                                       character
## swap_order_num                                      character
## cause_category                                      character
## cause_code                                          character
## resolution_code                                     character
## closure_category                                    character
## resolution_team                                     character
## service_affected                                    character
## service_order_num                                   character
## btu_type                                            character
## owner                                               character
## owner_name                                          character
## group_owner                                         character
## owner_position                                      character
## btu_platform                                        character
## dp_location                                         character
## created_date                                        character
## pending_verify_date                                 character
## closed_by                                           character
## closed_date                                         character
## source                                              character
## installed_date                                      character
## description                                         character
## repeat_ticket_count                                 character
## follow_up_ticket_count                              character
## fdp_device_name                                     character
## fdp_site_name                                       character
## olt_site_name                                       character
## exchange                                            character
## timestamp                                           character
## contact_id                                          character
## contact_name                                        character
## contact_office_phone                                character
## contact_mobile_phone                                character
## contact_home_phone                                  character
## contact_email_addr                                  character
## due_date                                            character
## part_num                                            character
## network_layer                                       character
## network_row_id                                      character
## asset_id                                            character
## ptt                                                 character
## zone                                                character
## service_point_id                                    character

Getting the dataset from Impala

Sample dataset - Selection trouble tickets only from Zone Kepong. The SQL is define by :-

Zone Kepong contains very rich information especially for the textual analysis and also one of the largest composition of the cause code & the resolution code which is good for the supervised learning.

Rules Description
a.status like ‘%Closed%’ Dataset must be closed for complete information
network_tt_id = ‘null’ Dataset must be not related to Network Trouble Ticket
trouble ticket type <> PASSIVE Trouble Ticket must related to the Active elements such as routers, switches , modem , etc. Excluding for now if related to the 3rd party causes , customer behavior and Passive elements
installed_date is NOT NULL This field must have value
created_date is NOT NULL This field must have value
closed_date is NOT NULL This field must have value
closed_date is NOT NULL This field must have value
product is NOT NULL This field must have value
sub_product is NOT NULL This field must have value
length description > 10 This field is useful for text analysis
rand() Record selection is in random mode
zone Should selective from different zone , sparse control

Generated SQL :-

select * from nova_trouble_ticket a join active_code b on (trim(a.cause_code) = trim(b.cause_code)) join exchange_zone c ON (trim(a.exchange)=trim(c.building_id)) and (b.code <> 'PASSIVE' ) where c.zone_name like '%ZONE KEPONG%' and a.status like '%Closed%'  and length(a.cause_category) > 1  and length(a.created_date) > 6 and length(a.closed_date) > 6 and length(a.installed_date) > 6 and a.package_name not like '%null%' and a.product not like '%null%' and a.sub_product not like '%null%'  and  length(a.description) > 10 and network_tt_id = 'null' order by rand() limit 10000 "

Dataset filtering

Removing non-related fields such as trouble ticket key , trouble ticket number , trouble ticket date etc.

conn <- src_impaladb(dbname='nova', host='10.54.1.151')
## [1] "here:"
## [1] FALSE
result <-  tbl(conn, sql("select a.tt_row_id,a.tt_num,a.tt_type,a.tt_sub_type,a.status,a.severity,a.important_message,a.appointment_flag,a.nova_account_name,a.nova_subscriber_num,a.nova_account_num,a.package_row_id,a.created_by,a.category,a.symptom_error_code,a.priority,a.product,a.sub_product,a.package_name,a.network_tt_id,a.swap_order_num,a.cause_category,a.cause_code,a.resolution_code,a.closure_category,a.resolution_team,a.service_affected,a.service_order_num,a.btu_type,a.owner,a.owner_name,a.group_owner,a.owner_position,a.btu_platform,a.dp_location,a.created_date,a.pending_verify_date,a.closed_by,a.closed_date,a.source,a.installed_date,a.description,a.repeat_ticket_count,a.follow_up_ticket_count,a.fdp_device_name,
a.fdp_site_name,a.olt_site_name,a.exchange,a.`timestamp`,a.contact_id,a.contact_name,a.contact_office_phone,a.contact_mobile_phone,a.contact_home_phone,a.contact_email_addr,a.due_date,a.part_num,a.network_layer,a.network_row_id,a.asset_id,a.ptt,a.zone,a.service_point_id , c.zone_name, c.district,c.state, c.region from nova_trouble_ticket a join active_code b on (trim(a.cause_code) = trim(b.cause_code)) join exchange_zone c ON (trim(a.exchange)=trim(c.building_id)) and (b.code <> 'PASSIVE' ) where c.zone_name like '%ZONE KEPONG%' and a.status like '%Closed%' and  length(a.cause_category) > 1 and length(a.created_date) > 6 and length(a.closed_date) > 6 and length(a.installed_date) > 6 and a.package_name not like '%null%' and a.product not like '%null%' and a.sub_product not like '%null%' and  length(a.description) > 10 and a.network_tt_id = 'null' order by rand() limit 100"))

result <- as.data.frame(result)

Close the connection from Impala

x <- conn$con
class(x) <- c('JDBCConnection')
dbDisconnect(x)
## [1] TRUE

Save the class as the data.frame

df <- as.data.frame(result)
df_tm <- as.data.frame(result)
df$contact_name <- NULL
df$contact_home_phone <- NULL
df$contact_email_addr <- NULL
df$contact_office_phone <- NULL
df$contact_mobile_phone <- NULL
df$`tt_row_id` <- NULL
df$`tt_num` <- NULL
df$tt_type <- NULL
df$`created_date` <- NULL
df$`closed_date` <- NULL
df$`installed_date` <- NULL
df$timestamp <- NULL
df$service_point_id <- NULL
df$contact_id <- NULL
df$owner_position <- NULL
df$tt_sub_type <- NULL
df$severity <- NULL
df$status <- NULL
df$important_message <- NULL
df$network_tt_id <- NULL
df$swap_order_num <- NULL
df$appointment_flag <- NULL
df$nova_account_name <- NULL
df$nova_subscriber_num <- NULL
df$nova_account_num <- NULL
df$repeat_ticket_count <- NULL
df$follow_up_ticket_count <- NULL
df$service_order_num <- NULL
df$source <- NULL
df$owner_name <- NULL
df$description <- NULL
df$due_date <- NULL
df$part_num <- NULL
df$zone <- NULL
df$ptt <- NULL
df$asset_id <- NULL
df$network_layer <- NULL
df$network_row_id <- NULL
df$pending_verify_date <- NULL
df$package_row_id <- NULL
df$priority <- NULL
summary(df)
##   created_by          category         symptom_error_code
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    product          sub_product        package_name      
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  cause_category      cause_code        resolution_code   
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  closure_category   resolution_team    service_affected  
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    btu_type            owner           group_owner       
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  btu_platform       dp_location         closed_by        
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  fdp_device_name    fdp_site_name      olt_site_name     
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##    exchange          zone_name           district        
##  Length:100         Length:100         Length:100        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     state              region         
##  Length:100         Length:100        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

Looping the columns name and rename it to [column name]+1 as the factor name

for(i in names(df)){

  num <- as.numeric(as.factor(df[,i]))
  df <- cbind(df,num)
  names(df)[names(df)=="num"] <- paste(names(df[i]),"_factor",sep = "")
  print(paste(names(df[i]),"1",sep = ""))
}
## [1] "created_by1"
## [1] "category1"
## [1] "symptom_error_code1"
## [1] "product1"
## [1] "sub_product1"
## [1] "package_name1"
## [1] "cause_category1"
## [1] "cause_code1"
## [1] "resolution_code1"
## [1] "closure_category1"
## [1] "resolution_team1"
## [1] "service_affected1"
## [1] "btu_type1"
## [1] "owner1"
## [1] "group_owner1"
## [1] "btu_platform1"
## [1] "dp_location1"
## [1] "closed_by1"
## [1] "fdp_device_name1"
## [1] "fdp_site_name1"
## [1] "olt_site_name1"
## [1] "exchange1"
## [1] "zone_name1"
## [1] "district1"
## [1] "state1"
## [1] "region1"
df <- df[27:52]
names(df)
##  [1] "created_by_factor"         "category_factor"          
##  [3] "symptom_error_code_factor" "product_factor"           
##  [5] "sub_product_factor"        "package_name_factor"      
##  [7] "cause_category_factor"     "cause_code_factor"        
##  [9] "resolution_code_factor"    "closure_category_factor"  
## [11] "resolution_team_factor"    "service_affected_factor"  
## [13] "btu_type_factor"           "owner_factor"             
## [15] "group_owner_factor"        "btu_platform_factor"      
## [17] "dp_location_factor"        "closed_by_factor"         
## [19] "fdp_device_name_factor"    "fdp_site_name_factor"     
## [21] "olt_site_name_factor"      "exchange_factor"          
## [23] "zone_name_factor"          "district_factor"          
## [25] "state_factor"              "region_factor"
write.csv(df,"ctt2014_small.csv")

Remove the predictors column which might have one unique value which can leads to zero variance result

The list below is non-zero variance variables

df <- df[,-nearZeroVar(df)] 
names(df)
##  [1] "created_by_factor"         "category_factor"          
##  [3] "symptom_error_code_factor" "product_factor"           
##  [5] "sub_product_factor"        "package_name_factor"      
##  [7] "cause_category_factor"     "cause_code_factor"        
##  [9] "resolution_code_factor"    "closure_category_factor"  
## [11] "resolution_team_factor"    "service_affected_factor"  
## [13] "btu_type_factor"           "owner_factor"             
## [15] "group_owner_factor"        "btu_platform_factor"      
## [17] "dp_location_factor"        "closed_by_factor"         
## [19] "fdp_device_name_factor"    "fdp_site_name_factor"     
## [21] "olt_site_name_factor"      "exchange_factor"

Finding Correlation

Find the correlation between the variables using Pearson.

correlations <- cor(df, use="pairwise.complete.obs", method="pearson")
print(correlations)
##                           created_by_factor category_factor
## created_by_factor                1.00000000    -0.072387248
## category_factor                 -0.07238725     1.000000000
## symptom_error_code_factor       -0.08267702     0.198715902
## product_factor                   0.08388301    -0.044170248
## sub_product_factor               0.16617625    -0.087225808
## package_name_factor              0.06271467    -0.104937159
## cause_category_factor            0.08414186     0.053469506
## cause_code_factor                0.06781792     0.072915126
## resolution_code_factor          -0.01756382    -0.040664360
## closure_category_factor          0.05410537     0.103440224
## resolution_team_factor           0.09611271     0.214571696
## service_affected_factor         -0.10211774    -0.294634711
## btu_type_factor                 -0.08708122    -0.014813596
## owner_factor                     0.43534980     0.046706493
## group_owner_factor               0.01388123     0.049131139
## btu_platform_factor             -0.03920239    -0.095018399
## dp_location_factor               0.01464646     0.045696501
## closed_by_factor                 0.43534980     0.046706493
## fdp_device_name_factor          -0.03085109     0.034875988
## fdp_site_name_factor            -0.03085109     0.034875988
## olt_site_name_factor            -0.01589588     0.006246004
## exchange_factor                  0.03843539     0.025839067
##                           symptom_error_code_factor product_factor
## created_by_factor                      -0.082677023     0.08388301
## category_factor                         0.198715902    -0.04417025
## symptom_error_code_factor               1.000000000    -0.10651881
## product_factor                         -0.106518810     1.00000000
## sub_product_factor                     -0.048073165     0.48755359
## package_name_factor                     0.015390256     0.77553621
## cause_category_factor                   0.158916426     0.24381425
## cause_code_factor                       0.136987911     0.06411406
## resolution_code_factor                  0.062227127     0.28855270
## closure_category_factor                 0.005429765    -0.03349907
## resolution_team_factor                  0.101226596    -0.37627774
## service_affected_factor                -0.317962939     0.23751093
## btu_type_factor                        -0.079434062     0.17131098
## owner_factor                            0.005034200    -0.02332783
## group_owner_factor                      0.106279328    -0.04467218
## btu_platform_factor                    -0.031844469     0.10889355
## dp_location_factor                      0.139383400    -0.03037303
## closed_by_factor                        0.005034200    -0.02332783
## fdp_device_name_factor                 -0.073277163    -0.02430041
## fdp_site_name_factor                   -0.073277163    -0.02430041
## olt_site_name_factor                   -0.094243710    -0.04138103
## exchange_factor                        -0.051399544    -0.01354695
##                           sub_product_factor package_name_factor
## created_by_factor                 0.16617625         0.062714675
## category_factor                  -0.08722581        -0.104937159
## symptom_error_code_factor        -0.04807316         0.015390256
## product_factor                    0.48755359         0.775536214
## sub_product_factor                1.00000000         0.555410598
## package_name_factor               0.55541060         1.000000000
## cause_category_factor             0.09546730         0.146596285
## cause_code_factor                 0.21221574         0.093513229
## resolution_code_factor            0.11958663         0.168443826
## closure_category_factor          -0.01441844        -0.061428430
## resolution_team_factor           -0.18675861        -0.295632612
## service_affected_factor           0.24294257         0.238479552
## btu_type_factor                   0.08918906         0.014639816
## owner_factor                      0.03674080         0.011394674
## group_owner_factor               -0.03216436         0.059038196
## btu_platform_factor               0.08599936        -0.003912656
## dp_location_factor                0.04786905         0.068812495
## closed_by_factor                  0.03674080         0.011394674
## fdp_device_name_factor           -0.09447411        -0.164135482
## fdp_site_name_factor             -0.09447411        -0.164135482
## olt_site_name_factor             -0.10305364        -0.202252716
## exchange_factor                  -0.10932264        -0.122840263
##                           cause_category_factor cause_code_factor
## created_by_factor                    0.08414186        0.06781792
## category_factor                      0.05346951        0.07291513
## symptom_error_code_factor            0.15891643        0.13698791
## product_factor                       0.24381425        0.06411406
## sub_product_factor                   0.09546730        0.21221574
## package_name_factor                  0.14659629        0.09351323
## cause_category_factor                1.00000000        0.25932650
## cause_code_factor                    0.25932650        1.00000000
## resolution_code_factor               0.46757026        0.15279525
## closure_category_factor              0.13686841       -0.11892871
## resolution_team_factor              -0.27080474        0.14141374
## service_affected_factor             -0.01872455        0.01142710
## btu_type_factor                      0.06395550       -0.05762476
## owner_factor                        -0.16523259       -0.09099995
## group_owner_factor                  -0.13980399       -0.12887935
## btu_platform_factor                  0.06837131       -0.07330032
## dp_location_factor                   0.02200759        0.17676213
## closed_by_factor                    -0.16523259       -0.09099995
## fdp_device_name_factor              -0.02086297       -0.28877825
## fdp_site_name_factor                -0.02086297       -0.28877825
## olt_site_name_factor                 0.01073973       -0.30538266
## exchange_factor                     -0.01147051       -0.32079033
##                           resolution_code_factor closure_category_factor
## created_by_factor                    -0.01756382             0.054105367
## category_factor                      -0.04066436             0.103440224
## symptom_error_code_factor             0.06222713             0.005429765
## product_factor                        0.28855270            -0.033499075
## sub_product_factor                    0.11958663            -0.014418442
## package_name_factor                   0.16844383            -0.061428430
## cause_category_factor                 0.46757026             0.136868408
## cause_code_factor                     0.15279525            -0.118928712
## resolution_code_factor                1.00000000             0.144926338
## closure_category_factor               0.14492634             1.000000000
## resolution_team_factor               -0.41490201            -0.054166643
## service_affected_factor              -0.01306866             0.104292947
## btu_type_factor                       0.33161471             0.054083570
## owner_factor                         -0.18849713            -0.108159548
## group_owner_factor                   -0.24600919            -0.097840859
## btu_platform_factor                   0.32934291             0.073878259
## dp_location_factor                   -0.24313781            -0.078841974
## closed_by_factor                     -0.18849713            -0.108159548
## fdp_device_name_factor                0.22182602             0.027996173
## fdp_site_name_factor                  0.22182602             0.027996173
## olt_site_name_factor                  0.25645181             0.054411002
## exchange_factor                       0.15430579             0.015553757
##                           resolution_team_factor service_affected_factor
## created_by_factor                     0.09611271             -0.10211774
## category_factor                       0.21457170             -0.29463471
## symptom_error_code_factor             0.10122660             -0.31796294
## product_factor                       -0.37627774              0.23751093
## sub_product_factor                   -0.18675861              0.24294257
## package_name_factor                  -0.29563261              0.23847955
## cause_category_factor                -0.27080474             -0.01872455
## cause_code_factor                     0.14141374              0.01142710
## resolution_code_factor               -0.41490201             -0.01306866
## closure_category_factor              -0.05416664              0.10429295
## resolution_team_factor                1.00000000             -0.13246490
## service_affected_factor              -0.13246490              1.00000000
## btu_type_factor                      -0.25822644              0.06582288
## owner_factor                          0.27837402             -0.19032833
## group_owner_factor                   -0.01349542             -0.28777200
## btu_platform_factor                  -0.21900417              0.11727936
## dp_location_factor                    0.16254032             -0.09467019
## closed_by_factor                      0.27837402             -0.19032833
## fdp_device_name_factor               -0.11517543              0.05452589
## fdp_site_name_factor                 -0.11517543              0.05452589
## olt_site_name_factor                 -0.10949318              0.04547937
## exchange_factor                      -0.03080287              0.11172718
##                           btu_type_factor owner_factor group_owner_factor
## created_by_factor             -0.08708122   0.43534980         0.01388123
## category_factor               -0.01481360   0.04670649         0.04913114
## symptom_error_code_factor     -0.07943406   0.00503420         0.10627933
## product_factor                 0.17131098  -0.02332783        -0.04467218
## sub_product_factor             0.08918906   0.03674080        -0.03216436
## package_name_factor            0.01463982   0.01139467         0.05903820
## cause_category_factor          0.06395550  -0.16523259        -0.13980399
## cause_code_factor             -0.05762476  -0.09099995        -0.12887935
## resolution_code_factor         0.33161471  -0.18849713        -0.24600919
## closure_category_factor        0.05408357  -0.10815955        -0.09784086
## resolution_team_factor        -0.25822644   0.27837402        -0.01349542
## service_affected_factor        0.06582288  -0.19032833        -0.28777200
## btu_type_factor                1.00000000  -0.15504445        -0.12921915
## owner_factor                  -0.15504445   1.00000000         0.33030464
## group_owner_factor            -0.12921915   0.33030464         1.00000000
## btu_platform_factor            0.91632426  -0.18329232        -0.14965284
## dp_location_factor            -0.77537561   0.12603192         0.14968938
## closed_by_factor              -0.15504445   1.00000000         0.33030464
## fdp_device_name_factor         0.68689996  -0.12146031        -0.07982665
## fdp_site_name_factor           0.68689996  -0.12146031        -0.07982665
## olt_site_name_factor           0.67620677  -0.11011242        -0.09849533
## exchange_factor                0.39386981  -0.07144686        -0.09452025
##                           btu_platform_factor dp_location_factor
## created_by_factor                -0.039202385         0.01464646
## category_factor                  -0.095018399         0.04569650
## symptom_error_code_factor        -0.031844469         0.13938340
## product_factor                    0.108893546        -0.03037303
## sub_product_factor                0.085999357         0.04786905
## package_name_factor              -0.003912656         0.06881250
## cause_category_factor             0.068371308         0.02200759
## cause_code_factor                -0.073300324         0.17676213
## resolution_code_factor            0.329342909        -0.24313781
## closure_category_factor           0.073878259        -0.07884197
## resolution_team_factor           -0.219004166         0.16254032
## service_affected_factor           0.117279362        -0.09467019
## btu_type_factor                   0.916324258        -0.77537561
## owner_factor                     -0.183292319         0.12603192
## group_owner_factor               -0.149652836         0.14968938
## btu_platform_factor               1.000000000        -0.79164019
## dp_location_factor               -0.791640195         1.00000000
## closed_by_factor                 -0.183292319         0.12603192
## fdp_device_name_factor            0.668271342        -0.65681126
## fdp_site_name_factor              0.668271342        -0.65681126
## olt_site_name_factor              0.670017850        -0.62280188
## exchange_factor                   0.421064534        -0.37695976
##                           closed_by_factor fdp_device_name_factor
## created_by_factor               0.43534980            -0.03085109
## category_factor                 0.04670649             0.03487599
## symptom_error_code_factor       0.00503420            -0.07327716
## product_factor                 -0.02332783            -0.02430041
## sub_product_factor              0.03674080            -0.09447411
## package_name_factor             0.01139467            -0.16413548
## cause_category_factor          -0.16523259            -0.02086297
## cause_code_factor              -0.09099995            -0.28877825
## resolution_code_factor         -0.18849713             0.22182602
## closure_category_factor        -0.10815955             0.02799617
## resolution_team_factor          0.27837402            -0.11517543
## service_affected_factor        -0.19032833             0.05452589
## btu_type_factor                -0.15504445             0.68689996
## owner_factor                    1.00000000            -0.12146031
## group_owner_factor              0.33030464            -0.07982665
## btu_platform_factor            -0.18329232             0.66827134
## dp_location_factor              0.12603192            -0.65681126
## closed_by_factor                1.00000000            -0.12146031
## fdp_device_name_factor         -0.12146031             1.00000000
## fdp_site_name_factor           -0.12146031             1.00000000
## olt_site_name_factor           -0.11011242             0.97273379
## exchange_factor                -0.07144686             0.88203830
##                           fdp_site_name_factor olt_site_name_factor
## created_by_factor                  -0.03085109         -0.015895876
## category_factor                     0.03487599          0.006246004
## symptom_error_code_factor          -0.07327716         -0.094243710
## product_factor                     -0.02430041         -0.041381032
## sub_product_factor                 -0.09447411         -0.103053644
## package_name_factor                -0.16413548         -0.202252716
## cause_category_factor              -0.02086297          0.010739731
## cause_code_factor                  -0.28877825         -0.305382658
## resolution_code_factor              0.22182602          0.256451813
## closure_category_factor             0.02799617          0.054411002
## resolution_team_factor             -0.11517543         -0.109493184
## service_affected_factor             0.05452589          0.045479369
## btu_type_factor                     0.68689996          0.676206768
## owner_factor                       -0.12146031         -0.110112425
## group_owner_factor                 -0.07982665         -0.098495332
## btu_platform_factor                 0.66827134          0.670017850
## dp_location_factor                 -0.65681126         -0.622801882
## closed_by_factor                   -0.12146031         -0.110112425
## fdp_device_name_factor              1.00000000          0.972733794
## fdp_site_name_factor                1.00000000          0.972733794
## olt_site_name_factor                0.97273379          1.000000000
## exchange_factor                     0.88203830          0.868256058
##                           exchange_factor
## created_by_factor              0.03843539
## category_factor                0.02583907
## symptom_error_code_factor     -0.05139954
## product_factor                -0.01354695
## sub_product_factor            -0.10932264
## package_name_factor           -0.12284026
## cause_category_factor         -0.01147051
## cause_code_factor             -0.32079033
## resolution_code_factor         0.15430579
## closure_category_factor        0.01555376
## resolution_team_factor        -0.03080287
## service_affected_factor        0.11172718
## btu_type_factor                0.39386981
## owner_factor                  -0.07144686
## group_owner_factor            -0.09452025
## btu_platform_factor            0.42106453
## dp_location_factor            -0.37695976
## closed_by_factor              -0.07144686
## fdp_device_name_factor         0.88203830
## fdp_site_name_factor           0.88203830
## olt_site_name_factor           0.86825606
## exchange_factor                1.00000000

Find the highest correlated variables.

Rules Description
- +.70 or higher Very strong relationship
- +.40 to +.69 Strong positive relationship
- +.30 to +.39 Moderate relationship
- +.20 to +.29 weak relationship
- +.01 to +.19 No or negligible relationship
# Choose 0.7 Very strong relationship
highlyCorrelated <- findCorrelation(correlations, 0.7 ,verbose = FALSE,names = TRUE)
highlyCorrelated
## [1] "olt_site_name_factor"   "fdp_device_name_factor"
## [3] "fdp_site_name_factor"   "btu_platform_factor"   
## [5] "btu_type_factor"        "owner_factor"          
## [7] "package_name_factor"

Summary of the correlated variables.

summary(correlations)
##  created_by_factor  category_factor    symptom_error_code_factor
##  Min.   :-0.10212   Min.   :-0.29463   Min.   :-0.317963        
##  1st Qu.:-0.03085   1st Qu.:-0.04329   1st Qu.:-0.073277        
##  Median : 0.02654   Median : 0.03488   Median : 0.005034        
##  Mean   : 0.09427   Mean   : 0.05361   Mean   : 0.044360        
##  3rd Qu.: 0.08408   3rd Qu.: 0.05238   3rd Qu.: 0.105016        
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.000000        
##  product_factor     sub_product_factor package_name_factor
##  Min.   :-0.37628   Min.   :-0.18676   Min.   :-0.29563   
##  1st Qu.:-0.03272   1st Qu.:-0.07744   1st Qu.:-0.09406   
##  Median :-0.01844   Median : 0.04230   Median : 0.01502   
##  Mean   : 0.12161   Mean   : 0.10936   Mean   : 0.09555   
##  3rd Qu.: 0.22096   3rd Qu.: 0.15453   3rd Qu.: 0.13333   
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.00000   
##  cause_category_factor cause_code_factor  resolution_code_factor
##  Min.   :-0.27080      Min.   :-0.32079   Min.   :-0.41490      
##  1st Qu.:-0.02033      1st Qu.:-0.11195   1st Qu.:-0.03489      
##  Median : 0.05871      Median : 0.03777   Median : 0.14886      
##  Mean   : 0.09083      Mean   : 0.02840   Mean   : 0.11669      
##  3rd Qu.: 0.14416      3rd Qu.: 0.14031   3rd Qu.: 0.24780      
##  Max.   : 1.00000      Max.   : 1.00000   Max.   : 1.00000      
##  closure_category_factor resolution_team_factor service_affected_factor
##  Min.   :-0.11893        Min.   :-0.41490       Min.   :-0.31796       
##  1st Qu.:-0.05961        1st Qu.:-0.21094       1st Qu.:-0.12488       
##  Median : 0.02177        Median :-0.08183       Median : 0.02845       
##  Mean   : 0.05125        Mean   :-0.01453       Mean   : 0.02918       
##  3rd Qu.: 0.06901        3rd Qu.: 0.13137       3rd Qu.: 0.10987       
##  Max.   : 1.00000        Max.   : 1.00000       Max.   : 1.00000       
##  btu_type_factor     owner_factor      group_owner_factor
##  Min.   :-0.77538   Min.   :-0.19033   Min.   :-0.28777  
##  1st Qu.:-0.08517   1st Qu.:-0.12146   1st Qu.:-0.12128  
##  Median : 0.05902   Median :-0.04739   Median :-0.06225  
##  Mean   : 0.15632   Mean   : 0.07912   Mean   : 0.01893  
##  3rd Qu.: 0.37831   3rd Qu.: 0.10620   3rd Qu.: 0.05656  
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.00000  
##  btu_platform_factor dp_location_factor closed_by_factor  
##  Min.   :-0.79164    Min.   :-0.79164   Min.   :-0.19033  
##  1st Qu.:-0.08959    1st Qu.:-0.34350   1st Qu.:-0.12146  
##  Median : 0.07112    Median : 0.01833   Median :-0.04739  
##  Mean   : 0.15262    Mean   :-0.10218   Mean   : 0.07912  
##  3rd Qu.: 0.39813    3rd Qu.: 0.12603   3rd Qu.: 0.10620  
##  Max.   : 1.00000    Max.   : 1.00000   Max.   : 1.00000  
##  fdp_device_name_factor fdp_site_name_factor olt_site_name_factor
##  Min.   :-0.65681       Min.   :-0.65681     Min.   :-0.622802   
##  1st Qu.:-0.11000       1st Qu.:-0.11000     1st Qu.:-0.107883   
##  Median :-0.02258       Median :-0.02258     Median :-0.004825   
##  Mean   : 0.17081       Mean   : 0.17081     Mean   : 0.169093   
##  3rd Qu.: 0.55666       3rd Qu.: 0.55666     3rd Qu.: 0.566626   
##  Max.   : 1.00000       Max.   : 1.00000     Max.   : 1.000000   
##  exchange_factor    
##  Min.   :-0.376960  
##  1st Qu.:-0.071447  
##  Median : 0.002042  
##  Mean   : 0.159936  
##  3rd Qu.: 0.333979  
##  Max.   : 1.000000

Plot correlated variables.

png(height=1200, width=1200, pointsize=15, file="corrplot.png")
corrplot(correlations, method = "number",tl.cex = 0.9 ,addCoef.col="grey", order = "AOE")
dev.off()
## png 
##   2

My caption.

Feature selection

Feature selection process to confirm which variable does become the independent and resolution code is the dependent variable via GBM (Stochastic Gradient Boosting).

List of other available model - http://topepo.github.io/caret/modelList.html

set.seed(777)
suppressWarnings(suppressMessages(library(mlbench)))
control <-
trainControl(method = "repeatedcv", number = 5)
model <-
train(
resolution_code_factor ~ ., data = df, method = "gbm", preProcess = "scale", trControl =
control , verbose = FALSE
)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.2.2
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:graph':
## 
##     join
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
importance <- varImp(model, scale = TRUE)
print(importance)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                            Overall
## cause_code_factor         100.0000
## cause_category_factor      88.4202
## resolution_team_factor     42.1766
## fdp_device_name_factor     29.1509
## owner_factor               22.4480
## service_affected_factor    12.3133
## created_by_factor          10.8777
## btu_type_factor             9.3577
## group_owner_factor          7.5313
## dp_location_factor          4.9879
## btu_platform_factor         3.4670
## closure_category_factor     3.2488
## sub_product_factor          0.6527
## olt_site_name_factor        0.0000
## closed_by_factor            0.0000
## exchange_factor             0.0000
## package_name_factor         0.0000
## category_factor             0.0000
## symptom_error_code_factor   0.0000
## product_factor              0.0000
plot(importance)

My caption.

So far , the main variables or factors found are :-

as based on the list the importance plot after the tuning and fitting predictive model process.

My caption.

After the status = Closed inside the trouble ticket dataset.

My caption.

Still need to discuss about this. Why we need this dataset ?

My caption. My caption.

I still do not know when the best time to train the dataset. The dataset keep updating every day when :-

I recommend that the dataset need to train for every week because of the size and the velocity of the dataset.

The Bayes model might need to rebuild if the prediction level become lower of the treshold ; i.e < 60% accuracy

Basically below is how i acquire the dataset as mention during the previous proposal defense below :-

My caption.


And this is how the implementation will be in the future :-

My caption.

Prediction

Anyway , this is the basic on how i train the sample dataset ( only for Kepong Zone ) using NaiveBayes and make the prediction of the resolution code :-

col_names <- names(df)
df[,col_names] <- lapply(df[,col_names] , factor)
names(df)
##  [1] "created_by_factor"         "category_factor"          
##  [3] "symptom_error_code_factor" "product_factor"           
##  [5] "sub_product_factor"        "package_name_factor"      
##  [7] "cause_category_factor"     "cause_code_factor"        
##  [9] "resolution_code_factor"    "closure_category_factor"  
## [11] "resolution_team_factor"    "service_affected_factor"  
## [13] "btu_type_factor"           "owner_factor"             
## [15] "group_owner_factor"        "btu_platform_factor"      
## [17] "dp_location_factor"        "closed_by_factor"         
## [19] "fdp_device_name_factor"    "fdp_site_name_factor"     
## [21] "olt_site_name_factor"      "exchange_factor"
# split 70/30 from dataset into training and testing
trainIndex <- createDataPartition(df$resolution_code_factor, p=0.70, list=FALSE)
## Warning in createDataPartition(df$resolution_code_factor, p = 0.7, list
## = FALSE): Some classes have a single record ( 2, 4, 6, 7, 8, 9, 13 ) and
## these will be selected for the sample
data_train <- df[ trainIndex,]
data_test <- df[-trainIndex,]
# train a naive bayes model
model <- NaiveBayes(as.factor(resolution_code_factor)~., data=data_train)
# make predictions
predictions <- suppressWarnings(suppressMessages(predict(model, data_test)))
# summarize results by predicting resolution code using the test dataset -> data_test
cm <- suppressWarnings(suppressMessages(confusionMatrix(predictions$class, data_test$resolution_code_factor)))
cm 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
##         1  14  0  0  0  0  0  0  0  0  1  1  0  0  0  0
##         2   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         3   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         4   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         5   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         6   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         7   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         8   1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         9   0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
##         10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         11  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##         14  0  0  0  0  0  0  0  0  0  0  0  0  0  4  0
##         15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8261          
##                  95% CI : (0.6122, 0.9505)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 0.05753         
##                                           
##                   Kappa : 0.6567          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.9333       NA       NA       NA       NA       NA
## Specificity            0.7500        1        1        1        1        1
## Pos Pred Value         0.8750       NA       NA       NA       NA       NA
## Neg Pred Value         0.8571       NA       NA       NA       NA       NA
## Prevalence             0.6522        0        0        0        0        0
## Detection Rate         0.6087        0        0        0        0        0
## Detection Prevalence   0.6957        0        0        0        0        0
## Balanced Accuracy      0.8417       NA       NA       NA       NA       NA
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity                NA       NA       NA   0.00000   0.00000
## Specificity                 1  0.95652  0.95652   1.00000   1.00000
## Pos Pred Value             NA       NA       NA       NaN       NaN
## Neg Pred Value             NA       NA       NA   0.95652   0.95652
## Prevalence                  0  0.00000  0.00000   0.04348   0.04348
## Detection Rate              0  0.00000  0.00000   0.00000   0.00000
## Detection Prevalence        0  0.04348  0.04348   0.00000   0.00000
## Balanced Accuracy          NA       NA       NA   0.50000   0.50000
##                      Class: 12 Class: 13 Class: 14 Class: 15
## Sensitivity                 NA        NA    0.8000   1.00000
## Specificity                  1         1    1.0000   1.00000
## Pos Pred Value              NA        NA    1.0000   1.00000
## Neg Pred Value              NA        NA    0.9474   1.00000
## Prevalence                   0         0    0.2174   0.04348
## Detection Rate               0         0    0.1739   0.04348
## Detection Prevalence         0         0    0.1739   0.04348
## Balanced Accuracy           NA        NA    0.9000   1.00000

Based on the dataset from Kepong zone , we manage to get the prediction level up to more than 80%

My caption.

So far , I’m only using the 70/30 data partition and also we also can consider to do the cross validattion sampling

Below is the option for 70/30 partition

trainIndex <- createDataPartition(df$resolution_code_factor, p=0.70, list=FALSE)

Below is the option for 10 folds of cross validation

trainControl(method = "repeatedcv", number = 10, repeats = 3)

My caption.

Mention here on how to clean and filter the data via the SQL statement using Impala

My caption.

The sample dataset have been converted into a factor and been applied here

Sample dataset after converted into factor which is required by the Bayes algorithm :-

head(df)
##   created_by_factor category_factor symptom_error_code_factor
## 1                48               1                         7
## 2                22               2                         3
## 3                29               1                         7
## 4                72               1                         7
## 5                14               1                         7
## 6                66               1                         7
##   product_factor sub_product_factor package_name_factor
## 1              2                 19                   8
## 2              3                 10                   6
## 3              1                  7                   3
## 4              3                 12                   8
## 5              3                 12                   8
## 6              3                 12                   8
##   cause_category_factor cause_code_factor resolution_code_factor
## 1                     3                 5                      1
## 2                     3                 7                      8
## 3                     4                 9                      6
## 4                     8                17                     11
## 5                     2                22                     15
## 6                     5                 2                     13
##   closure_category_factor resolution_team_factor service_affected_factor
## 1                       3                      2                       5
## 2                       3                     15                       1
## 3                       3                     15                       3
## 4                       1                     16                       1
## 5                       3                     15                       4
## 6                       3                     14                       3
##   btu_type_factor owner_factor group_owner_factor btu_platform_factor
## 1               3           10                  3                   4
## 2               3           15                  1                   2
## 3               2           29                  3                   2
## 4               3           39                  1                   4
## 5               3           30                  1                   4
## 6               3           14                  1                   4
##   dp_location_factor closed_by_factor fdp_device_name_factor
## 1                  1               10                     59
## 2                  1               15                     63
## 3                  5               29                     40
## 4                  2               39                     52
## 5                  1               30                     30
## 6                  1               14                     54
##   fdp_site_name_factor olt_site_name_factor exchange_factor
## 1                   59                   21               2
## 2                   63                   24               2
## 3                   40                   11               2
## 4                   52                   19               2
## 5                   30                   10               1
## 6                   54                   21               2

My caption.

Done that here and how to select the independent variables been mention here via the rank of the important factor

My caption.

I’m still working on the other method such as Random Forest , GBM and RPart. We need to discuss on this

My caption.

Tried before on how to do the feature selection via Generalized Boosted Regression here

My caption.

I’m thinking on clustering the cause code & the resolution code because of the unique count each of them is around

I also have some reference on the category of the cause code. So far the cause code can be divided into :-

So the clustering of the cause code & resolution is needed to reduce the size and become more manageble. So we can find the relationship among the variables.

My caption.

Need to disucss and I will do this later

My caption.

Need to discuss on this

My caption.

Need to discuss on this

Data Cleansing & Transformation

# workflow <- print(grViz("
# digraph neato  {
# 
# 
# ratio = \"fill\";
# size=\"8.3,11.7!\";
# margin=0;
# 
# # graph attributes
# graph [overlap = true, fontsize = 10]
# 
# node [shape = box,style = filled,
#         fontname = Arial,
#         color = darkslategray , fillcolor = orange]
# 
# # edge statements
# edge [color = grey]
# GET_Zone_Distinct_List -> RETURN_Zone_List -> LOOP_Zone_List -> GENERATE_SQL_filter_by_zone; DESIGN_Base_SQL ->
# APPLY_record_filter -> RECEIVE_Zone_Paramater -> GENERATE_SQL_filter_by_zone ->
# ORDER_records_in_random_mode -> Limit_records_per_zone -> Download_dataset -> Load_in_R->COVERT_dataset_as_dataframe->REMOVE_unrelated_fields->FETCH_sample_dataset->Split_70_30_for_training_and_testing->APPLY_feature_extraction->GET_related_field_for_training->APPLY_training_from_dataset->BUILD_model_from_training;
# Download_dataset-> SAVE_Trouble_Ticket_No_As_Reference;
# Download_dataset-> SAVE_Description_Text_As_Reference;
# }
# "),engine = "neato")
# c<- grVizOutput(workflow)

My caption.

The example shows the process workflow

Previously have been mentioned here by applying SQL and condition rules here

Text Mining and Category Identification

During my last proposal defense , i proposed to develop the dictionary vector as the enrichment source for the trouble ticket prediction and also to identify the category of the messages.

My caption.

I have done some simple code in R to demonstrate this :-

df_tm <- data.frame(df_tm$description , stringsAsFactors = TRUE)
mycorpus <- Corpus(DataframeSource(df_tm))
# remove all unrelated char/text
tdm <- TermDocumentMatrix(mycorpus, control = list(weight = weightTfIdf  ,
removePunctuation = TRUE, stopwords = TRUE , removeNumbers = TRUE , stemming  = FALSE))
# remove sparse term
#z <- m[c("oil", "zone"),
tdm <- removeSparseTerms(tdm, sparse= 0.8 )
# Generate the dictionary vector limit by 10 docs
c <- suppressWarnings(suppressMessages(inspect(tdm[,1:10])))
## <<TermDocumentMatrix (terms: 124, documents: 10)>>
## Non-/sparse entries: 769/471
## Sparsity           : 38%
## Maximal term length: 20
## Weighting          : term frequency (tf)
## 
##                       Docs
## Terms                  1  2 3 4 5 6 7 8 9 10
##   acc                  0  0 1 0 1 0 0 0 0  0
##   account              1  1 2 1 2 4 4 2 3  1
##   acct                 1  1 1 2 1 1 1 1 1  1
##   acknowledge          0  0 0 0 0 0 0 0 1  0
##   active               1  1 1 1 1 1 1 1 1  0
##   address              3  2 3 4 2 2 2 2 2  2
##   addressing           1  1 1 1 1 1 1 1 1  0
##   advice               1  3 0 3 2 1 1 1 3  2
##   advise               0  0 0 0 0 0 0 0 3  0
##   advised              0  0 0 0 1 1 0 2 0  0
##   alive                1  0 0 0 0 0 0 0 0  1
##   already              2  1 0 0 0 0 0 0 0  0
##   also                 0  0 0 0 1 0 0 0 0  0
##   assigned             1  1 1 1 1 1 1 1 1  0
##   available            0  0 1 2 1 1 0 1 1  0
##   back                 0  2 0 0 0 0 1 0 1  0
##   bin                  0  0 0 2 0 0 1 0 2  0
##   bras                 1  1 1 2 1 1 1 1 1  1
##   bsrbatunifige        0  2 0 0 0 0 3 0 0  0
##   btu                  1  0 1 1 2 0 1 0 1  0
##   cable                1  0 1 1 1 0 1 0 0  0
##   call                 0  3 1 0 0 1 0 0 1  1
##   called               0  0 0 0 0 0 0 1 0  0
##   caller               0  1 1 1 1 1 1 1 1  1
##   calling              1  1 1 2 1 1 1 2 1  1
##   check                2  0 5 3 3 5 5 1 3  3
##   checked              0  0 0 0 0 0 0 2 0  0
##   connect              1  1 1 2 1 1 1 1 1  1
##   connection           0  1 1 2 1 0 0 2 0  1
##   contact              0  1 1 1 1 1 1 1 1  1
##   cpe                  0  0 0 0 1 0 0 0 0  0
##   create               2  2 2 1 1 1 0 2 0  0
##   created              1  1 1 1 1 1 2 1 2  0
##   ctt                  2  1 1 2 1 1 1 0 0  1
##   cust                 3  0 0 5 0 1 0 0 3  0
##   customer             2 10 2 7 7 4 2 5 7  4
##   date                 2  3 2 2 2 2 2 2 2  0
##   detail               1  0 0 1 0 0 0 0 1  1
##   details              1  1 0 2 1 1 0 0 2  0
##   dlink                1  0 1 2 1 0 1 1 1  1
##   done                 1  0 2 0 0 1 0 0 0  3
##   download             1  1 1 1 1 1 1 1 1  0
##   dsl                  1  0 0 0 1 0 0 0 1  1
##   dynamically          1  1 1 1 1 1 1 0 1  0
##   english              0  1 0 0 0 1 0 1 0  0
##   equipment            0  0 0 0 0 0 0 0 0  0
##   eth                  1  2 0 0 0 3 0 0 0  1
##   exchange             1  1 1 1 1 1 1 1 1  1
##   framed               1  1 1 2 1 1 1 1 1  1
##   full                 0  1 0 0 0 0 0 0 0  0
##   get                  0  0 0 0 0 0 0 1 0  0
##   holder               0  0 0 0 1 0 1 0 0  1
##   hsi                  2  0 2 2 1 1 1 1 3  2
##   huaweiqosprofilename 1  1 1 1 1 1 1 1 1  0
##   huaweiuserclass      1  1 1 1 1 1 1 1 1  0
##   inform               1  3 0 1 1 0 0 0 0  0
##   informed             0  0 0 0 0 0 0 1 0  0
##   internet             2  1 1 1 1 0 1 2 2  2
##   iptv                 2  0 1 1 1 0 5 1 2  1
##   jan                  0  3 0 0 0 6 4 2 0  0
##   lan                  2  0 8 0 4 3 4 8 5  5
##   language             1  1 1 0 1 1 1 1 1  1
##   light                0  0 2 2 1 0 0 0 1  0
##   login                1  1 1 1 1 1 1 1 1  0
##   mac                  2  1 2 2 1 1 1 2 1  1
##   malay                1  0 1 0 0 0 1 0 1  1
##   modem                0  0 1 1 1 0 0 1 1  0
##   modified             1  1 1 1 1 1 1 1 1  0
##   name                 1  3 3 2 2 2 3 3 3  1
##   number               2  1 2 2 2 0 2 3 3  1
##   onu                  0  0 1 0 0 0 0 1 0  0
##   orange               0  0 1 0 1 0 0 0 0  0
##   order                1  1 0 1 1 1 1 1 1  1
##   outages              0  0 0 1 1 1 1 1 1  1
##   outcome              1  1 1 0 0 1 1 0 1  1
##   owner                1  1 0 1 0 1 1 0 1  0
##   package              1  1 1 0 1 1 1 0 1  1
##   persist              2  0 0 1 1 0 0 2 1  0
##   phone                0  0 1 1 0 1 1 0 0  1
##   ping                 1  1 1 2 1 1 1 1 1  1
##   power                4  0 2 0 2 3 2 2 2  3
##   pppoe                1  1 1 1 1 1 1 0 1  0
##   preferred            0  1 1 0 0 1 1 0 0  1
##   problem              3  2 2 2 2 1 2 3 2  1
##   proceed              0  2 0 1 0 1 0 1 0  0
##   profile              1  1 1 1 1 1 1 1 1  0
##   query                2  1 1 3 1 2 1 1 2  1
##   rad                  1  1 1 1 0 1 1 1 0  0
##   raduser              1  1 1 1 1 1 0 1 1  0
##   reboot               1  0 4 2 1 0 1 1 2  3
##   result               3  1 2 6 2 3 1 2 3  1
##   scheme               1  1 1 1 1 1 1 1 1  0
##   sec                  0  2 0 0 0 3 2 1 0  0
##   service              0  0 0 0 1 0 0 0 0  1
##   services             0  0 0 0 1 1 0 0 1  1
##   session              3  4 3 5 3 4 5 3 3  4
##   sgb                  2  3 1 1 0 4 0 0 1  0
##   since                0  1 0 0 0 1 0 0 0  0
##   soc                  0  1 0 1 1 0 0 1 0  1
##   speed                1  1 1 1 1 1 1 1 1  0
##   station              1  1 1 2 1 1 1 2 1  1
##   status               2  1 3 1 2 3 2 1 2  0
##   stb                  3  0 0 0 0 0 3 0 0  0
##   still                1  1 0 2 1 0 0 0 1  0
##   streamyx             2  1 1 3 1 2 1 1 2  1
##   system               0  2 0 0 0 0 0 1 0  0
##   time                 2  4 3 4 2 2 3 2 3  2
##   total                1  1 1 2 1 1 1 1 1  1
##   troubleshooting      0  2 1 1 0 1 0 0 1  0
##   unable               0  0 0 1 0 0 0 1 0  0
##   upload               1  1 1 1 1 1 1 1 1  0
##   usage                1  1 1 1 0 1 1 1 1  0
##   user                 1  0 0 2 0 1 0 0 1  0
##   username             1  1 1 1 1 1 1 2 1  1
##   using                0  0 0 1 0 0 0 0 0  0
##   vdsl                 1  0 0 0 1 0 0 0 1  1
##   verify               0  0 0 2 2 0 0 0 1  0
##   vip                  1  1 0 0 1 1 1 0 1  0
##   vipm                 1  1 0 1 1 1 1 0 1  0
##   vobb                 2  1 3 2 2 2 2 2 2  2
##   voip                 1  0 1 0 0 0 0 1 1  1
##   will                 0  1 0 0 0 0 0 0 0  0
##   wireless             1  0 1 0 1 0 1 1 1  1
##   xpon                 0  0 1 0 0 0 3 1 0  0
# Plotting the relationship between words
plot(tdm, corThreshold = 0.2, weighting = TRUE)

So this dictionary vector can be used in the future to add on to the current dataset as the new variables.

 # disable this code due to lots of messages appears
 # transpose_tdm<- invisible(as.data.frame(inspect(t(tdm[,]))))
 # disable this code due to lots of messages appears
 # cause_code_tdm <- cbind(cbind(df$tt_row_id,df$cause_code),transpose_tdm[1:10,1:10])
 # write to CSV
 # write.csv(cause_code_tdm,"transpose_tdm.csv")
 transpose_tdm <- read.csv("transpose_tdm.csv")  
 head(transpose_tdm)
##   X trouble_ticket_id               Cause_code acc account acct
## 1 1          1-WTJASD   Clear While Localising   0       1    1
## 2 2          1-WTJ924       Customer_Behaviour   0       1    1
## 3 3          1-WTJPNC        Drop Fiber Damage   1       2    1
## 4 4          1-WTGIQE TM_CPE_PG Config Problem   0       1    2
## 5 5          1-WTI5UQ        VDSL Modem Faulty   1       2    1
## 6 6          1-WTDQ9Q    AccessNetwork_Failure   0       4    1
##   acknowledge active address addressing advice advise advised
## 1           0      1       3          1      1      0       0
## 2           0      1       2          1      3      0       0
## 3           0      1       3          1      0      0       0
## 4           0      1       4          1      3      0       0
## 5           0      1       2          1      2      0       1
## 6           0      1       2          1      1      0       1

We need to discuss this further.


Journal 1 - Paper

Still in progress.

Submission date ?

My caption. My caption. My caption. My caption. My caption.


Journal 2 - Paper

Still in progress. Need to discuss.

Submission date ?

My caption.

My caption.

My caption.


Journal 3 - Paper

Still in progress. Need to discuss.

Submission date ?

My caption.

My caption.

My caption.

My caption.